Some words to start with

Welcome to the page! This is an essential part of my final assignment to the course “Introduction to Open Data Science”, or as we friends call it “IODS” course. My name is Laura Matkala and I am a PhD student who studies forests. I have to say this is one of the most inspiring courses I have taken in a while. I will do my best with all the new skills I have learned during the course to make a best possible outcome for this assignment!

knitr::include_graphics('C:/HY-Data/MATKALA/GitHub/IODS-final/figures/49.jpg')
Happy forest scientist by a lake in a mountain forest at Mammoth Lakes, CA, USA.(This is here to remind us that although it doesn't look like it now, the sun actually does exist...)

Happy forest scientist by a lake in a mountain forest at Mammoth Lakes, CA, USA.(This is here to remind us that although it doesn’t look like it now, the sun actually does exist…)

About the dataset

I chose to use the dataset Boston, which includes data about housing in the suburbs of Boston , Massachusettes, USA. I will later perform linear regression and logistic regression to the variable “crim”, but first some basic information about the dataset.

knitr::include_graphics('C:/HY-Data/MATKALA/GitHub/IODS-final/figures/boston.jpg')
The dataset has variables related to housing in the suburbs of Boston, Massachusettes, USA. Picture from: http://amtrakdowneaster.com/stations/boston

The dataset has variables related to housing in the suburbs of Boston, Massachusettes, USA. Picture from: http://amtrakdowneaster.com/stations/boston

I didn’t do any data wrangling related to the part where I will perform linear regression, but I did something for the latter logistic regression part. You can find the R script file with all the data wrangling and codes here. The variables in the dataset are:

Analysis

Linear regression

Data exploration

To start with the linear regression I need to read in the data as well as call the needed packages. I will also check the structure and dimensions of the data to see that everything is in order after the data wrangling and saving the file as csv.

Boston<-read.csv(file = "C:/HY-Data/MATKALA/GitHub/IODS-final/Boston.csv", header = TRUE, sep=",")
library(GGally); library(ggplot2)
str(Boston)
## 'data.frame':    506 obs. of  15 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##  $ crime  : Factor w/ 4 levels "high","low","med_high",..: 2 2 2 2 2 2 4 4 4 4 ...
dim(Boston)
## [1] 506  15

Everything seems to be ok with the data and it looks like I meant it to look like at this point. Let’s make a couple of plots to see what the data looks like.

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

Ok, we see that “rad”, “tax” and “indus” have high correlations with “crim”. I will thus use those variables for creating a linear multiple regression model, where the three first mentioned variables are used as explanatory variables for “crim”. A linear multiple regression model in this case takes the form \(y = \alpha+\beta_1x_1+\beta_2x_2+\beta_3x_3+\epsilon\), where rad, dis and ptratio are estimates of \(\beta\), \(\alpha\) is an intercept and \(\epsilon\) is the error term/random noise.

Building the model

Let’s start modeling!

my_model <- lm(crim ~ rad + tax + indus, data = Boston)
summary(my_model)
## 
## Call:
## lm(formula = crim ~ rad + tax + indus, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.180  -1.457  -0.248   0.763  76.418 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.048463   1.142904  -2.667  0.00789 ** 
## rad          0.563131   0.084871   6.635 8.42e-11 ***
## tax          0.001631   0.005083   0.321  0.74840    
## indus        0.055531   0.064352   0.863  0.38859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.72 on 502 degrees of freedom
## Multiple R-squared:  0.3932, Adjusted R-squared:  0.3896 
## F-statistic: 108.4 on 3 and 502 DF,  p-value: < 2.2e-16

In linear regression the aim is to minimize the residuals, which are the prediction errors of the model. The best model fit is found so that the sum of the squared residuals is minimized. The \(\beta\) values this model gives us are 0.6, 0.001 and 0.06,respectively, whereas the \(\alpha\) (intercept) is -3.0. Standard error for \(\alpha\) is 1.1 and for \(\beta\)s 0.08, 0.005 and 0.06, respectively. Based on p-values it seems that only “rad” is important for the model. I will thus leave “tax” and “indus”out and formulate another model.

my_model2<- lm(crim ~ rad, data = Boston)
summary(my_model2)
## 
## Call:
## lm(formula = crim ~ rad, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.164  -1.381  -0.141   0.660  76.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
## rad          0.61791    0.03433  17.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16

Well, no dramatic changes compared to the previous situation if we look at the residuals. The standard error for “rad” as well as its p-value got smaller, though, compared to model number 1. The same happened to the intercept, so I guess in this sense the latter model is better than the first one. Anyway, I come to the same conclusion as I did when looking into this dataset in the exercise where we used LDA: with higher index of accessibility to radial highways there are better possibilities to access the highway, and thus escape after committing a crime.

Logistic regression

The model and its odds ratio

It’s time to do another kind of analysis for the same data. I choose to use logistic regression for this. I will use the same explanatory variables as with linear regression and formulate the hypotheses based the on linear regression results. However, for logistic regression I am using the dependent variable “crime” instead of “crim”, since the dependent variable in logistic regression needs to be categorical. So “crime” is categorical, but basically otherwise the same as “crim”. As I mentioned earlier, here you can see how I have done the mutation of the dependent variable.

My hypotheses are:

1. Index of accessibility to radial highways affects per capita crime rate.

2. Full-value property-tax rate does not affect per capita crime rate.

3. Proportion of non-retail business acres does not affect åer capita crime rate

And here is the model, its summary and coefficients.

m <- glm(crime ~ rad + tax + indus, data = Boston, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = crime ~ rad + tax + indus, family = "binomial", 
##     data = Boston)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.01639  -0.22661   0.02370   0.04368   2.48870  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 10.785353   3.590682   3.004 0.002667 ** 
## rad         -0.455414   0.132091  -3.448 0.000565 ***
## tax          0.002602   0.007998   0.325 0.744938    
## indus       -0.256292   0.176308  -1.454 0.146041    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 570.179  on 505  degrees of freedom
## Residual deviance:  59.836  on 502  degrees of freedom
## AIC: 67.836
## 
## Number of Fisher Scoring iterations: 10
coef(m)
##  (Intercept)          rad          tax        indus 
## 10.785353478 -0.455413629  0.002601907 -0.256291520

Since the explanatory variables are not categorical, all the necessary information is shown above. According to the summary table “rad” is once again the only explanatory variable that is necessary for the model. This supports hypotheses 2 and 3. Therefore, I will leave the other variables out, adjust the model to its final form, show its summary and present its odds ratios.

m2 <- glm(crime ~ rad, data = Boston, family ="binomial")
summary(m2)
## 
## Call:
## glm(formula = crime ~ rad, family = "binomial", data = Boston)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4429  -0.2225   0.0577   0.0731   2.4836  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.28246    1.29238   6.409 1.47e-10 ***
## rad         -0.47166    0.05816  -8.109 5.11e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 570.179  on 505  degrees of freedom
## Residual deviance:  62.752  on 504  degrees of freedom
## AIC: 66.752
## 
## Number of Fisher Scoring iterations: 8
library(tidyr)

OR <- coef(m2) %>% exp
CI<-confint(m2) %>% exp
cbind(OR, CI)
##                       OR       2.5 %       97.5 %
## (Intercept) 3953.9201755 573.4230913 1.548093e+05
## rad            0.6239669   0.5330939 6.840731e-01
knitr::include_graphics('C:/HY-Data/MATKALA/GitHub/IODS-final/figures/holy_moly.jpg')

What an earth is going on with the intercept odds ratio?!? Howcome is it so high? If we look at the odds ratio of “rad” it is approximately 0.6. This means, if I understood correctly, that if “rad” is high, there is a 60 % chance for “crime” to be high as well. But before I can say all my hypotheses are confirmed I need to look into the predictive power of the model.

The predictive power

library(dplyr)
probabilities <- predict(m2, type = "response")
Boston <- mutate(Boston, probability = probabilities)
Boston <- mutate(Boston, prediction = probability>0.5 )
select(Boston, rad, crime, probability, prediction) %>% tail(10)
##     rad    crime probability prediction
## 497   6 med_high   0.9957328       TRUE
## 498   6 med_high   0.9957328       TRUE
## 499   6  med_low   0.9957328       TRUE
## 500   6  med_low   0.9957328       TRUE
## 501   6  med_low   0.9957328       TRUE
## 502   1      low   0.9995948       TRUE
## 503   1      low   0.9995948       TRUE
## 504   1      low   0.9995948       TRUE
## 505   1  med_low   0.9995948       TRUE
## 506   1      low   0.9995948       TRUE
table(crime = Boston$crime, prediction = Boston$prediction)
##           prediction
## crime      FALSE TRUE
##   high       126    1
##   low          0  127
##   med_high     6  120
##   med_low      0  126

Well, it looks like for high values of “crime” the model is not working??? oR what.